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Most of the success of the last few years has been driven by the rapid growth 
of deep learning, more efficient tools capable of learning semantic, high-level, 
deeper features of images are proposed. In this article, we investigated the task 
of pedestrian detection on roads using models based on convolutional neural 
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1. INTRODUCTION 

The increasing number of vehicles during this century has made road accidents a major cause of death. 
Traffic accidents in Morocco cause more than 4000 deaths each year, 25% are pedestrians. Both the scientific 
community and the automobile industry have contributed to the development of various types of protection 
systems to improve the vehicle’s safety and environmental performance. At the moment, the main goal in 
this field is to provide drivers with information about their environment and any potential dangers. Two of all 
useful information are the detection and location of pedestrians in front of a vehicle. Traditional object detection 
techniques in the past were based on are based on handcrafted features such as integral channel features (ICF) 
(1), P], scale-invariant feature transform (SIFT) [B], histogram of oriented gradients (HOG) [l], local binary 
patterns (LBP) [5], general forward-backward (GFB) [6] ,and their variations [7]-[9] and combinations [10], 
[1], followed by a trainable classifier such as support vector machines (SVM) [7], [12], boosted classifiers 
[13], or random forests [14]. Their performance can be easily degraded by constructing complex ensembles 
that combine numerous low-level features with high-level context from object detectors and scene classifiers. 
With the rapid progress of deep learning technology, more effective tools capable of learning semantic, high- 
level, and deeper features are being introduced to address the issues in traditional architectures. 

Deep convolutional neural networks (DCNN) [15]-[/19] provide us such ability with high performance 
for various computer vision applications. Our study focuses on detecting pedestrians in individual monocular 
images using state-of-the-art object detection approaches based on neural networks. The rest of this research 
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study includes: section 2 describes object detection models we will address. After presenting and discussing 
our testing results in section 3, we conclude in section 4. 


2. OBJECT DETECTION MODELS 
2.1. Faster R-CNN 

Faster region-based convolutional network (R-CNN) proposed by Ren et al. , runs at 7 FPS using 
Nvidia TiTan X graphic card. It employs a separate network known as the region proposal network to identify 
region proposals. The predicted regions are then reshaped with the help of a region of interest (ROI) pooling 
layer. After that Faster R-CNN classifies the image within the proposed region and predicts the bounding box 
offset values. Its structure is illustrated in Figure 1. 
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Figure 1. Faster R-CNN architecture 


2.2. YOLOv3 

You only look once (YOLO) invented by Redmon and Farhadi [21], is a convolutional network (CNN) 
based open-source object detection and classification algorithm. At first glance, it can tell which objects are 
present in an image and where they are located. The primary benefit of this technique is that a single neural 
network evaluates the entire image. Using an Nvidia TiTan X graphic card, the network can process images in 
real-time at 45 frames per second, and a simplified version called Fast YOLO can process images at 155 frames 
per second, outperforming other real-time detectors. Furthermore, in the background, YOLO generates fewer 
false positives in the background. The YOLO algorithm’s structure consists of conventional neural networks, 
see Figure 2. YOLO begins by splitting the input image into SxS grids (S=13, S=26, and S=52), with B 
bounding locations predicted for each grid (B=3 for YOLOv3). Each boundary box includes many variables: 
X, y, W, h, box confidence score, and C class probabilities. The confidence score indicates the probability of an 
object being present in the box and the precision of the boundary box. x and y are cell offsets. The width w and 
height h of the bounding box are normalized by the image’s width and height. The final output of each scale 
has a structure of (S, S, Bx(5 + C)). 


Figure 2. YOLOv3 architecture 


2.3. SSD 

The single shot detector (SSD) detector proposed by Liu et al. [22], runs at 19 FPS using Nvidia TiTan 
X graphic card. It’s a feed-forward convolutional neural network consisting of a base net and an auxiliary 
architecture. The main aspect of SSD is that multiscale features are gathered to detect targets. SSD’s main 
function is to predict category scores and box offsets for a predefined set of default bounding boxes by applying 
small convolutional filters to feature maps, followed by a non-maximum suppression step to produce the final 
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detections. SSD like YOLO only needs one shot to identify different objects in an image, SSD allows more 
aspect ratios than YOLO. As a result, it can deal with objects of various sizes. The SSD network structure is 
shown in Figure 3. 
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Figure 3. SSD architecture 


3. RESULTS AND DISCUSSION 

For our experiments, we used the common objects in context (COCO) database for training and 
the daimler mono pedestrian dataset for testing, experiments are executed on a computer with Intel Xeon 
CPU E3-1226 v3 Quard-core 3.3 GHz and 12 GB of RAM, Ubuntu 20 OS, Python 3.8.5, and Tensorflow 2.4.1 
deep-learning framework. Our testing dataset consists of the following items: 4097 images captured from 
a vehicle at video graphics array (VGA) resolution (640x480) P5], and each image has a ground-truth 
file indicating the real position of pedestrians existing in the image. For model evaluation we used: i) Time 
prediction represents the time that the model takes to predict the bounding boxes and the category class of 
objects; and ii) Average precision is a widely used metric for evaluating the accuracy of object detectors, it 
represents the surface under the precision-recall curve. In this paper, we calculated the average precision using 
the Cartucho code source [27]. Experimental findings presented in Table 1 reveal several interesting points: i) 
Faster R-CNN has better detection capabilities. however, is not suitable for real-time solutions; and ii) YOLOv3 
is the best object detection model than SSD for Pedestrians in terms of detection and time prediction. 


Table 1. Models performance 


Model AP(%) Prediction time/image(s) 
Faster R-CNN neural architecture search(NAS) 39.4 21.398288 
SSD ResNetS0 feature pyramid network(FPN) 22.68 0.822386 
Yolo V3 DarkNet59 31.6 0.38899 


4. CONCLUSION 

In this paper, various types of pre-trained object detection models for pedestrians are implemented 
and tested. Results are compared using the performance parameters average precision, and time prediction. 
YOLOv3 is the best object detection model than others for pedestrians in terms of detection and time predic- 
tion. Results are very promising, but there are still some perspectives for our future research. Firstly, detect 
pedestrians in hard conditions (weather conditions and night vision). Secondly, investigating the impact of loss 
function on pedestrian detection models. 
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